
Workshop Aims to Examine the IsiZulu National Corpus
The University Language Planning and Development Office (ULPDO) hosted a two-day IsiZulu Corpus Workshop that was attended by partners and stakeholders from the KwaZulu-Natal Legislature, the Department of Arts and Culture, eThekwini Municipality, Ukhozi FM, Isichazamazwi SesiZulu, language freelancers, retired language experts, and staff and students from the University of KwaZulu-Natal and the Durban University of Technology.
Facilitated by Professor Elsabe Taljard from the Department of African Languages at the University of Pretoria, the event aimed to give a comprehensive understanding of a corpus, its nature and applications, and review the IsiZulu National Corpus (INC) - a meticulous collection of linguistic data developed by the ULPDO as a national resource.
Welcoming guests, Dean and Head of the School of Arts, Professor Nobuhle Hlongwa emphasised the importance of a corpus in the intellectualisation of African languages. ‘We are proud as UKZN to have the largest INC, and we will continue to build on it because without the corpus it would be difficult to develop dictionaries,’ she said.
Taljard who started off the workshop by complimenting the ULPDO as a permanent and sustainable language development and planning office, said, ‘I always use the work done by the ULPDO at UKZN as an example because it’s one of the few universities that seems to grasp the importance of language and terminology development.’ She said that corpus was a highly specialised field that involved running texts and was different from a term bank which contains a list of terms. She defined corpus as ‘a systematic collection of machine read-able authentic texts which can be sampled to represent a particular language or language variety.’ Identifying the different types of corpora, she said that general corpora included huge and re-useable data, while specialised corpora is domain/genre specific and disposable.
She highlighted the INC as a general corpus that can be used in lexicography, terminology, linguistics and human language technologies, adding that the balance of text, representativeness of the sample, and its sample size are the three closely related factors of corpus.
She said corpus could be sourced publicly through newspapers, journals, magazines and internet sites, while private data included information that was not in the public domain, such as personal networks and or institutional information. For hard copies to be made part of a corpus, scanning is required and the use of an Optical Character Recognition (OCR) to provide clean data sets.
Taljard described the importance of metadata in providing authentic and accurate data and listed its four categories namely, editorial, analytic, descriptive, and administrative. She commented on keywords as a useful way to characterise the text or genre being used in the corpus allowing for text retrieval and or term extraction to take place.
Discussing the strengths and possible improvements of the INC, she said it was a wonderful resource that was being underutilised and reflected on how accessibility to the platform should be improved in a practical feasible manner. She urged the ULPDO to make the INC more visible through a promotional campaign that encourages researchers, lexicographers, terminologists, and corpus linguists to make use of it. She further implored the office to create a smaller corpus that is theme/topic specific adding value to the overall project. She also underscored the importance of paying close attention to metadata, balance, representativeness and the sample for the INC to be used as an accurate and prestigious dataset for research and researchers.
Other topics covered in the workshop included Corpus and Lexicography, Corpora and Corpus Linguistics, and Corpora and Terminology.
In his closing remarks, ULPDO Acting Director, Mr Khumbulani Mngadi noted how this workshop was developed from observing the lack of interaction with the corpus. Commenting on the workshop conducted by Taljard, he said, ‘The scope you have covered is sufficient for us to start making the necessary changes… as we still have a long way to go in ensuring that isiZulu is one of the languages that is fully intellectualised.’
Mngadi shared ULPDO’s concept of intellectualising African languages through five development pillars of the INC, terminology, research/social cohesion, literature and human language technologies (eg IsiZulu term bank and spellchecker). He also highlighted the importance of developing computational linguists and terminologists from undergraduate level, seeing the value of creating specialists and experts within the field.
Mngadi thanked all those who had been involved in the development of the INC for the past decade. He specifically thanked the School of Education for their continued support in various ULPDO projects and mentioned that it was not a coincidence that the programme director, Dr Nokukhanya Ngcobo was from the School. Mngadi also thanked Taljard as the guest speaker, saying, ‘We are grateful for the suggestions you have given us in improving our corpus and plan to implement them, making it more user friendly.’
Words: Hlengiwe Khwela
Photograph: Andile Ndlovu